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Abstract 

A new iterative algorithm, the multi-level algorithm, for the numerical solution of steady 
state Markov chains is presented. The method utilizes a set of recursively coarsened rep 
resentations of the original system to achieve accelerated convergence. It is motivated by 
multigrid methods, which are widely used for fast solution of partial differential equations. 
Initial results of numerical experiments are reported, showing significant reductions in com- 
putation time, often an order of magnitude or more, relative to the Gauss-Seidel and optimal 
SOR algorithms for a variety of test problems. The paper also contrasts and compares the 
multi-level method with the iterative aggregation-disaggregation algorithm of Takahashi. 
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1. Introduction 


Markov systems generated by computer modeling tools such as queueing network, Petri nets, 
or reliability modeling packages may contain hundreds of thousands of states. The resulting 
sparse linear systems of equations have a correspondingly large number of unknowns and must, 
in general, be solved numerically using an iterative scheme. Typical methods are the Power, 
Gauss-Seidel, and SOR methods. All of these methods have the drawback that they may require 
many iterations to reach a solution, particularly if the system is large, or a if high degree of 
accuracy is required. This can lead to unacceptably long computation times. 

A similar situation is found when solving partial differential equations, where systems of 
many hundreds of thousands of unknowns are not uncommon. Here, however, a relatively 
new algorithm - the multigrid method - has met with considerable success, achieving, under 
appropriate conditions, substantially improved solution speeds compared to traditional methods 
such as Gauss-Seidel and SOR. One should more accurately consider multigrid to be a class of 
methods, as the basic framework allows a wide variety of choices for each of the constituent 
components. The method to be presented in this paper is in many respects related to this 
class of algorithms: the Markov system is recursively coarsened and values obtained from these 
smaller systems are used to achieve faster convergence. 

In this paper we present a new solution algorithm, the Multi-Level algorithm, for Markov 
chains. Our initial experiments indicate the multi-level algorithm does not require the Markov 
chain to have any special structure in order to achieve excellent performance, although if such 
structure does exist it may be possible to exploit that structure to achieve even better perfor- 
mance. We can offer no proof of convergence, but in all experiments we have run the method 
always converges. The convergence theory for multigrid methods is relatively limited, applying 
largely to equations of elliptic type. However the methods are widely used on all classes of linear 
and non-linear PDEs. We present experimental results for the Gauss-Seidel, SOR, and Multi- 
Level algorithms when applied to Markov chains generated from birth-death processes, finite 
population tandem queueing networks, blocking (finite capacity) tandem queueing networks, 
and a canonical stochastic Petri-net model. 

For purposes of brevity we will refer to the Gauss-Seidel algorithm as the GS algorithm, and 
the Multi-Level algorithm as the ML algorithm. Note that the phrase ’’multi-level algorithm is 
also used to denote a class of methods related to multigrid. These bear a structural resemblance 
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to the scheme presented here in that they make nse of coarser subproblems to achieve acceler- 
ated solution; they are, however, otherwise unrelated. The remainder of the paper is structured 
as follows. In the following section, after some preliminary remarks, the multi-level method is 
described. In section 3 we describe related work and compare and contrast the multi-level algo- 
rithm with existing algorithms. In section 4 results of experiments comparing the performance 
of the method to GS, SOR, and the algorithms of Takahashi for a variety of test problems are 
presented. In section 5 we discuss the practical aspects of implementation of the multi-level 
method and provide a list of possible directions for future research. In the final section we 
summarize the paper. 

2. Multi-Level Solution Algorithm 

In this section we explain our new ML solution algorithm. We first summarize classical multi- 
grid techniques in section 2.1. We then review classical Markov chain aggregation techniques in 
section 2.2. In section 2.3 we give a detailed description of our ML algorithm. 

2.1 Multigrid Methods 

Multigrid algorithms are a relatively recent development in the field of iterative solvers for 
large systems of equations, dating from the late seventies [1]. They were originally applied to the 
systems of equations that arise from the discretization of elliptic boundary value problems, and it 
is lor these equations that most multigrid theory has been developed. For this class of problems 
multigrid algorithms are among the fastest known solvers, being of optimal complexity, i.e. 
having computation times that are linear in the size of the input. An introduction to multigrid 
algorithms may be found in [2], [7] and [17]. 

Multigrid algorithms begin by defining a set of increasingly coarse representations (grid levels) 
of the original problem, each of which has only a fraction of the number of degrees of freedom 
as its predecessor. The algorithm uses a standard iterative procedure such as GS at each grid 
level to quickly reduce error components that are high frequency w.r.t. that level ( smoothing ). 
A smoothing sweep through all grid levels efficiently reduces errors across the entire spectrum. 
In addition to the smoother, operators are required that transfer information from a fine to a 
neighboring coarse level {restriction) and vice versa ( prolongation ). 

Multigrid algorithms work most efficiently on regularly structured elliptic problems, whose 
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coefficients vary smoothly between neighboring unknowns. In such cases they can achieve an 
error reduction of an order of magnitude per iteration. Conversely, multigrid algorithms for 
problems that are non-elliptic, unstructured or have rapidly varying coefficients do not in general 
perform as well and currently represent an active field of research. Markov chains can possess 
one or more of the above characteristics. One branch of multigrid research which attempts to 
deal with general sparse systems is known as algebraic multigrid, see [11]. 

Given an appropriate choice of smoother, multigrid algorithms can be parallelized with a 
high degree of efficiency, see [8] for a recent survey. This is, of course, an added advantage in 
the context of modern supercomputer architectures and networked workstations. 

The algorithm to be presented in section 2.3 may be viewed as a multigrid-like algorithm. 
However, owing to the absence of a grid structure and because of the difference in approach 
to multigrid schemes, we will refer to the algorithm more abstractly as a multi-level algorithm. 
Because of the similarities between the algorithms, we nevertheless expect that many of the 
established multigrid ideas can be applied to the Markov chain solver, and this indeed proves to 
be the case. 

2.2 Aggregation of Markov Chains 

Consider a steady state continuous time Markov chain consisting of n states sj . . . s n . Denote 
the vector of unknowns by p, where p; is the probability of being in state s, . 

We then have to solve the system of equations 

Pp = 0 (1) 

with the additional condition 

2 pi = 1 

i=l 

Note, equation (1) is simply a reformulation of the classic continuous time Markov chain 
equation: 

ttQ = 0 (3) 

where P is the transpose of the generator matrix Q, and p is the transpose of steady state 
probability vector x. Equations (1) and (2) form a sparse linear system which is typically solved 
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Figure 1 : Aggregation of Markov Chains 


numerically using the GS or SOR algorithm. These schemes suffer the drawback of needing a 
large number of iterations when n is large or when a high degree of accuracy is required. 

A coarser representation of the Markov chain described by P may be obtained by aggregation. 
This means creating a new Markov chain Q with the vector of state probabilities < 7 , each of whose 
N states S\ . . . Syv is derived from a small number of states of P . Figure 1 illustrates the situation 
for an eight-state birth/death chain (P), where states are aggregated in pairs to form a four- 
state coarser level system (Q), which in turn is pairwise aggregated to form the coarsest level 
two-state system (R). 

In the following we will use the terms fine level and coarse level to refer to Markov chains 
where the latter is obtained by aggregation from the former. The relation s*. € 5, signifies that 
the fine level state s * is mapped by the aggregation operation to the coarse level state S{. 

The matrix Q of the aggregated system may be chosen as follows : 





( 4 ) 


This is the classical aggregation equation. Note that the matrix Q is a function not only of the 
fine level matrix P, but also of the fine level solution vector p. 

This yields the aggregated equation in the unknown q: 


Qq = 0 


( 5 ) 
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with the additional condition 


N 


2 * = 1 • 


( 6 ) 


It can then be shown that 

«, = E n , (7) 

SkGSi 


i.e. the solution q of the aggregated system truly represents a coarser version of the solution p 
of the original problem. The probability of being in state qi is the sum of the probabilities of 
being in any of its constituent fine-level states. 

A further relation between the solutions of coarse and fine systems is given by the disaggre- 
gation equation : 


i=N E Pl p ‘k 
* — 1 . /-<? 


s i 


( 8 ) 


This equation reduces to (1) when (7) is applied. 

One may proceed recursively to obtain yet further coarsenings, arriving eventually at some 
coarsest system which may consist of only two states. 

The idea behind the ML algorithm is to use approximations obtained on coarser systems to 
improve the current approximation to the fine solution. We therefore use the coarser representa- 
tion to obtain a correction to p. One iteration of the ML algorithm will proceed in the multigrid 
manner from the original, finest system down to the coarsest, setting up coarse level equations 
and performing GS smoothing, and then back up to the finest level, computing and applying 
corrections. Note, however, that other orderings of processing the various levels are possible. 

2.3 Description of the Algorithm 

We adopt the following abbreviations for vectors a, b, c £ %t m : 

a = b * c = a,- = b{ * Cj, 1 < i < m 

a - b/c = a, = bi/ci, 1 < i < m 


We enter the (n + l)th iteration of the ML algorithm with the current approximation to the 
solution obtained as a result of the (n)th iteration, whereby p(°) denotes the initial guess, 
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and begin by performing one or more sweeps of the GS algorithm, obtaining the vector p: 

p = GS(pW) . ( 9 ) 

We assume throughout that application of GS includes a subsequent normalisation step to 
enforce (2). The vector p will not, in general be the solution p of (1), but we may write 

P*P* = P , (10) 

where p* is the elementwise multiplicative correction necessary to p. Knowledge of p* would 
immediately enable us to compute the solution p. We may write (1) as 

P(P*P*) = 0 . ( 11 ) 


Since p has been smoothed by application of the GS algorithm, we assume that it no longer 
contains any high frequency error components, and thus that p* is smooth. Therefore we may 
compute an approximation to p * on a coarsened system, since the dimension of the latter is 
smaller, and thus the computation will be cheaper. We write a coarsened version of (11) as 

Q(q*q*) = 0 , (12) 

where Q is the matrix of the aggregated system and q* and q represent aggregated representations 
of p* and p, respectively. 

The coarse system matrix Q is chosen to be an approximation to the the matrix Q from (4), 
replacing p by p, since p is not available until the algorithm has converged: 



E Pk 1 

( E Pki) 




E Pk 


s k eSi 


(13) 


In the case of a converged solution, we will, however, have p = p and therefore the correct 
coarse matrix Q — Q. 

In order to obtain q in (12) we require an operator that maps a fine level vector to the 
coarser, aggregated vector. This operator will be denoted by R (from the multigrid restriction 
operation), and we write 

q = R{p) . (14) 
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We choose summation for iZ, in accordance with (7): 


q = R(p) = qi= Yj P* 

»*€5i 


( 15 ) 


This choice for R has the property 

q = R(p ) , 

i.e. the exact fine level solution is mapped by R to the exact coarse level solution. It is clear 
that this property is necessary, since at convergence, both (1) and (5) must be satisfied. 


We proceed by defining 


q = q*q* , 


(16) 


thus obtaining the coarse level equations to be solved: 


Qq = 0 , 


N 


= 1 


(17) 


Solving (17) for q will therefore enable us to compute q*, the coarse approximation to the 
correction via (16): q* = q/q. 

We compute the fine-level correction from its coarse approximation using the operator / 
(interpolation): 

P* = 7(0 • (18) 


We choose the following operation for I: 

P* = /(<?*) = Pl = q* s k eSi . (19) 


We then compute the new iterate p( n+1 ) using 

p(”+ 1 )=p=C(p,p*) = p*p * . (20) 

If the algorithm converges, we hope that p( n+1 ) will be a better approximation to p than p( n \ 
In general we have pl”'!"*) ^ p, since the correction p* was computed only approximately on a 
coarse grid, using an incorrect matrix Q ^ Q. 
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By analogy with the SOR scheme, which computes an over- corrected iterate compared to the 
underlying GS algorithm, we may consider using an overcorrection for the ML scheme: 

V* = /(?*) = pt = q:(u + (l-u)q:) Sk € S', , (21) 

where we set 0 < u> < 1. For u < 1 such an operation will overdo the correction, since values 
q * < 1 will be decreased and values of q * > 1 will be enlarged. The parameter u thus plays an 
analogous role to that of the over-relaxation parameter in the SOR scheme. It is to be hoped 
that, as is the case for the SOR scheme, certain choices of a? may lead to improved solution 
efficiency. We will consider the utility of over- correction in section 4.2. 

The relationship between the correction and the disaggregation equation can be seen by 
comparing the r.h.s. of (8) after application of (14): 

i=N 

5Z Yh p‘ Pik 

i=l s t €Si 

and after the correction (20): 


Y Y p‘ Pik 

t=l 

= EE 

«=i i,e5i 

After the iteration has converged, we have 

q* = l N , p* = r , 

where 1 denotes the vector (1, 1, . . ., 1) T , i.e. no further correction takes place. We then also 
have Q = Q and therefore q = q. 

Note that the question of assigning fine nodes to their coarse counterparts is still open. This 
we will call the coarsening strategy or aggregation strategy. The aggregation strategy can have 
a significant effect on the performance of the ML algorithm. In cases where we have some 
knowledge of the structure of the Markov chain, for example when queueing networks are to be 
modelled, then we may utilize this information in the construction of the aggregated system. In 
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other cases, mapping strongly coupled fine states to the same aggregated state seems to be an 
efficient strategy. 

Note that the successive application of (18) followed by (14) fulfills 

R(I(q)) = cq C € U, V q 6 . , (22) 

1. e. a coarse vector q is invariant up to a constant factor under the operation of prolongation 
followed by restriction. An analogous relationship between the prolongation and restriction 
operators in a classical multigrid algorithm is frequently required. Furthermore the composite 
coarse grid correction operator C(p, I{q*)) preserves the relative probabilities of all fine level 
states mapped to the same coarse level state by aggregation: 

p = C(p , /(?*)) ^ ^ s kl ,s k2 eSi, 1 < i < N . 

Pki Pk2 

The ftuo-level version of the ML iteration is given by the sequence of steps (9), (14), (17), (18), 
(20). The multi- level algorithm is obtained by recursive application of the two-level algorithm to 
obtain a solution to the aggregated equation (17) and is described in algorithmic form in Figure 

2. We use the subscript / to denote level of representation (/ — Imax finest level, l = 0 coarsest 
level). The coarse level l - 1 and fine level l between which the operators / and R map are 
identified by appropriate indices. Note that, because of the recursive nature of the algorithm, 
the unknowns q *, q and q are represented by the variables p*_ 1? p/_ i and p/_ i, respectively. 

The Multi-Level algorithm is non-linear, owing to the use of the coarsened system obtained 
via (13), although the original problem (1) is linear. It seems therefore unlikely that theoretical 
results can be obtained for estimates of the convergence speed of the algorithm for general 
problems. 

We allow in general the possibility of applying GS v times at each level with v > 1, denoted 
by GS". We also consider the possibility of a more complex cycle form (in particular F- and 
W-cycles, see the multigrid literature), obtained by multiple recursive calls to procedure mss. 

3. Related Work 

Currently most performance tools requiring the solution of Markov chains use either the 
power, GS, or SOR algorithms. In a paper by Stewart and Goyal [15] the various techniques are 
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procedure mss(l) 
if (/ = 0) 

solve Pipi = 0 
else 

pi = GS" (.pi) 

Pi- x ■ Ri-i,i{pi) 

mss (/ — 1) 
p*— i * pi-i/pt-i 
Pj * //-X,/(p*_l) 
P/ * C{pi,p*) 
return 


Figure 2: Multi-Level Algorithm 

compared and SOR with dynamic tuning of the relaxation parameter emerges as the method 
of choice. Initial results of the ML algorithm show that it generally outperforms optimal SOR, 
often by on order of magnitude or more, without any parameters to tune. 

Our work is related to a large body of work on aggregation-disaggregation techniques. Most 

previous work using aggregation makes the assumption that the number of aggregate states, A, 

is much less than the number of states, n, i.e. N <C n. In our algorithm we generally assume 

N - n 
iy 2 ' 

In addition, much of the related work assumes that the Markov chains being solved are 
generalized birth-death processes [16, 14], or that the Markov chains are nearly completely 
decomposable systems [5, 6] In the latter case the solution is usually an approximation often 
accompanied by bounds on the error. We refer the reader to [12] for descriptions of these special 
Markov chain structures and for a more comprehensive list of references. Our work differs in 
that it does not require any special structure in the Markov chain, and the result is exact, not 
an approximation. 

The work that most strongly resemble ours is the algorithm of Takahashi [16] and its variants 
[12]. We subsequently use the terminology derived in the previous section. The Takahashi 
algorithm starts with an initial iterate for the fine level chain. The fine level chain of n states is 
then aggregated into a coarse Markov chain of N states, where N <C n using equation (13). The 
coarse chain is then solved (equation (17)), and the correction obtained from the coarse chain 
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is applied to the previous iterate in the fine level. A new iterate at the fine level is obtained 
as follows. The fine level states are grouped together into N sets of states where the members 
of each set correspond to each aggregate state in the coarse level. The set of linear equations 
corresponding to each of these sets of states is solved independently of the other fine level states 
by treating the values for the other states as constants. The new iterate at the fine level is the 
result of solving the equations for each set of states. The algorithm iterates between the two 
levels until sufficient accuracy is obtained. 

We may view the Takahashi algorithm as a special case of ML, obtained by the following 
choices, compared to the ML scheme we prefer: 

1. Use of only two levels of representation of the system, rather than multiply coarsened 
problems. 

2. N < n, as opposed to N = £ for a small c (typically 2 or 4). 

3. Block Gauss-Seidel or Block Jacobi as a smoothing procedure on the finer level, as opposed 
to a small number of steps of a pointwise Gauss-Seidel scheme. 

4. Use of post- rather than pre-smoothing, i.e. processing the fine-level after, rather than 
before the coarse-level solve. 

It is our claim that 1, 2 and 3 above are not the best choices. Point 4 proves to have little 
effect, as is also the case for classical multigrid methods, and is in fact irrelevant in the context 
of performing many iterations in succession. It is therefore more a conceptual difference. In 
section 4.3 an experimental result disproves point 2 above. In point 3, we observe that solving 
entire subsystems of size n/N can be prohibitively expensive and contend that solutions can 
be more efficiently obtained by pointwise GS applied on a larger set of more slowly coarsened 
problems. 

We restate that our motivation for the ML algorithm was not to modify current aggregation- 
disaggregation algorithms, but rather to devise a algorithm similar to multi-grid algorithms 
which have shown exceptional merit in solving elliptic boundary value problems. We find it 
helpful to not view our algorithm as a Markov chain aggregation-disaggregation variant, but 
instead view it as a multigrid-like scheme. 
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4. Experimental Results 

In this section we present experimental results to show how our new algorithm compares 
with GS and SOR. All experiments presented assume continuous time Markov chains. We 
have also solved discrete time Markov chains with similar improvements in performance relative 
to SOR and GS. In section 4.1 we first compare ML to GS and SOR for a variety of test 
problems assuming an unsophisticated implementation of the ML algorithm. In section 4.2 
we demonstrate the potential of improving ML performance via a few techniques including 
intelligent aggregation, varying the number of smoothing steps at each level, over- correction, 
and F and W cycles. In section 4.3 we present experimental results comparing ML to both the 
Takahashi and nested Takahashi algorithms [16]. 

4.1 Generic Multi-Level Results 

In this section we present our generic ML results. By generic we mean that the ML algorithm 
used is the simplest one possible, a V cycle (each iteration goes from the finest level down to 
the coarsest level and back up to the finest level), no overcorrection, only applying one iteration 
of the smoother (GS) at each level, and a simple aggregation strategy. In particular, we assume 
states of the Markov chain are ordered 0 ... n - 1 , and that the aggregation policy is by pairing 

neighboring states by index: s, € Sj & I = |^j. If a level has an odd number of states then 

the last state is included in the last coarse level state. Unlike SOR, this ML algorithm has no 
parameters to tune. In all of our experiments we give the benefit of the doubt to SOR and assume 

we can find the optimal relaxation parameter We find u> to the nearest by using a 

binary search between 1.0 and 2.0. This results in over- optimistic metrics for the SOR algorithm 
since in practice u? must be found via dynamic tuning, thus resulting in additional iterations. By 
presenting best case results for SOR and worst case results for ML, we strengthen our argument 
that ML may be a more promising solution algorithm than SOR. In all experiments we assume 
the initial iterate is set to the vector (^, . . . , We also assume the ML algorithm recursively 
coarsens the set of equations until the final coarse sytem has only two states. 

Possible metrics for comparing the algorithms are the number of iterations, the process time, 
the number of floating point operations (flops), and the geometric mean of the convergence rate. 
The number of flops was computed by inserting a counter into all three programs. The process 


12 




time is obtained from the unix system call timesQ and is the CPU time used while executing 
instructions in the user space of the programs. In all cases we have found the process time to 
be a more conservative measure than flops, i.e. the comparison of ML to GS and SOR is less 
favorable when using process times than when using flops. Hence to strengthen our arguments 
of the utility of the ML algorithm we chose process time as our primary metric. Note that the 
process time also includes time for generation of the ML structure, whereas the flops metric 
would miss this factor. In addition, the flops metric does not capture the additional pointer 
operations needed by the ML algorithm for accessing elements in the coarse levels. To enable 
quicker comparison of the algorithms we also plot the ratio of process time for SOR and GS 
relative to the process time for ML. We also consider the number of iterations necessary for 
convergence. In general, the ML algorithm requires far fewer iterations than the other two 
algorithms, but consumes more time per iteration since each iteration requires one application 
of the smoother at each level and the construction of the coarse level matrices Q . 

We first consider a birth-death Markov chain with a finite number of states. We vary the the 
number of states assuming a birth rate of 1 and a death rate of 2. The results are presented 
in figure 3. The number of iterations increases linearly with the number of states for the GS 
and SOR algorithms, whereas the number of iterations remains fixed at 21 iterations for the ML 
algorithm. Intuitively, probability mass only moves slowly through the system in the GS and 
SOR algorithms, whereas one iteration of the ML algorithm can move mass from one end of the 
chain to the other via the coarse level problems. 

The number of flops and the process time of SOR and GS increase quadratically for SOR and 
GS with the number of states, whereas both metrics increase only linearly for ML. Thus ML 
is an optimal method for this particular problem. The GS (SOR) algorithm requires 257 (128) 
times more processing time than ML for a birth-death chain of 10,000 states. The ratios increase 
with system size. Even for small birth death chains, such as 1,000 states, the ML algorithm is 
more than an order of magnitude faster than GS and SOR. 

One possible reason that the GS and SOR algorithms require more time as the number of 
states is increased is that they must move probability mass a longer distance in the solution 
vector. To determine whether it is this distance or the overall number of states we conduct the 
following experiment. We assume a birth death chain with each state having two additional 
exiting transitions beyond the birth and death. State s t - has a transition to state s,-+ m and a 
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transition to state s t _ m . If (i + m) > n — 1 then the transition is to state n — 1. Similarly, if 
{i — m) < 0 then the transition is to state sq. We initially set the number of states equal to 500 
and rn equal to 2. We then successively double the number of states and the distance m . We 
define the diameter of a Markov chain to be the maximum distance (or number of transitions) 
between any two states. Hence, in this experiment, regardless of the number of states, the 
diameter is fixed at 250. We assume all transitions in the birth direction proceed at rate 1, and 
all transitions in the death direction are at rate 2. 

The results from this experiment are presented in figure 4. The number of iterations necessary 
for SOR and GS initially rises with an increase in the number of states, but then levels off. 
Conversely, the process time steadily increases, hence more work is done per iteration as the size 
increases. It appears that both the size and diameter of the Markov chain influence convergence 
speed of GS and SOR. The number of iterations for the ML algorithm varies from 19 to 25. Thus 
the diameter of the chain does not appear to have much effect on the solution speed (measured 
in iterations) of the ML algorithm. 

We next investigate the sensitivity of the relative performance of the algorithms to the ratio 
of birth rate to death rate. We fix the birth rate at 1 and vary the death rate from 0.001 to 
1000. The number of states is equal to 10,000. The results are presented in figure 5. Note that 
both axes are scaled logarithmically. 

The performance of the GS algorithm is always worse than that of the ML algorithm. The 
SOR algorithm performs better than ML when the death rate is less than the birth rate, but 
one to two orders of magnitude worse when the roles are reversed. In the best case observed 
(for SOR), SOR requires ~ of the process time of ML. Note that the computation times for GS 
and SOR could be made symmetric by exchanging the birth and death rates or by reordering 
the states. The ML algorithm does not require any such special techniques to achieve good 
performance, and hence is more resilient to changes in transition rates. 

The excellent performance of SOR when the death rate is less than the birth rate is surprising. 
In fact, it is not realistic . For a ratio of | or less SOR converges in less than 30 iterations. We 
were able to achieve this excellent performance by first determining the optimal u> to the nearest 

by applying SOR many times with u values chosen in a binary search. In practical 
situations the optimal value of u> is not known a priori and the solution will be calculated only 
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once. Hence u> must be obtained via some dynamic procedure. It is impractical to assume 
that an SOR algorithm including dynamic tuning of u can converge in only 30 iterations. In 
fact, in a recent paper proposing a dynamic method for determining u>, [3] , 30 iterations are 
executed before tuning of u even begins. The paper notes that often thousands of iterations 
were necessary to find the optimal u. Thus, the excellent performance of SOR in figure 5 could 
never be achieved in practice. We conjecture that ML will perform better than any existing 
dynamic tuning SOR algorithm for the entire parameter range in this experiment. 

We next explore Markov chains generated from simple queueing models. We first assume a 
closed system tandem queue model. The queueing system is shown in figure 9. We assume a 
finite population where jobs think for an exponential period of time with rate A (i.e. the jobs 
visit an infinite server and are served at rate A). Jobs are served in queues 1 and 2 at rates 
\i\ and p 2 respectively, and upon leaving queue 2 return to queue 1 with probability p. The 
states of the Markov chain are generated from an initial state of all jobs in the think state, 
and then by constructing the chain in a breadth-first fashion. States are numbered 0 through 
n — 1 as they are created in the breadth first search. The aggregation policy used in the ML 
algorithm is to pair adjacent-numbered states. Hence, the performance of the ML algorithm is 
almost certainly sub-optimal, since no intelligent aggregation techniques are being used. In all 
experiments reported we fix the think rate to 1.0. 

In the first experiment we set p, the probability that a job leaving queue 2 returns to queue 
1, to 0.0, pi to 1.0, and p 2 to 2.0 and vary the population from 25 to 250. This results in a 
range of 351 to 31,626 states in the underlying Markov chain. The results of the experiment are 
plotted in figure 6. The ML algorithm requires far less computation time for the solution than 
the other algorithms, especially as the population increases. 

We next consider the effect of the relative values of pi and p 2 on the performance of the 
algorithms. We fix the think rate to 1.0, probability p to 0.0, the population to 100 (resulting 
in 5151 states in the underlying Markov chain), set p i to 1.0, and vary p 2 from 0.001 to 1000. 
The results are plotted in figure 7. The ML policy results in lower computation times than the 
other algorithms across the entire parameter space, and when pi > P 2 the solution time is an 
order of magnitude faster. Experiments with larger populations demonstrate a more pronounced 
difference in the performance of ML relative to GS and SOR. 
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We finally consider the effect of probability p, i. e. the probability that a job leaving queue 2 
is returned to queue 1. We set the think rate to 1.0, the population to 150 (11,476 states), and 
vary p for three different settings of p\ and p 2 . The performance of ML is much less sensitive 
to the ratio of the rates and to p than GS or SOR. When pi — p 2 = 1.0, GS ranges from 6 to 
12 times slower than ML, and SOR ranges from 2 to 6 times slower than ML. When p\ = 1.0 
and p 2 = 2.0, GS ranges from 6 to 12 times slower than ML, and SOR ranges from 10 times 
slower to 1.5 times faster than ML. When p\ = 2.0 and p 2 = 1-0, GS ranges from 5.3 to 2.2 
times slower than ML, and SOR ranges from 2.7 times slower to 1.5 times faster than ML. 

We now consider the application of the three algorithms to the solution of the underlying 
Markov chain of a canonical stochastic Petri net. Figure 11 shows the complexities, measured 
in KFLOPs, of the GS, SOR and ML algorithms applied to the underlying Markov chain of the 
stochastic Petri net of Molloy [9], depicted in Figure 10, with populations ranging from 10 to 
60. The Markov chains for this problem were generated using the SPNP ( stochastic Petri net 
package ) tool of Ciardo et al [4]. ML is superior to GS for all cases tested, the improvement 
being a factor of approximately 3.3 for the smallest and approximately 16.6 for the largest case 
considered. Against SOR with an optimally chosen overrelaxation parameter, ML performs 
slightly worse only for the smallest problem considered. 

4.2 Multi-Level Acceleration Techniques 

In this section we consider a few techniques that can be applied to the ML algorithm to 
further accelerate convergence speed. Some of these techniques (F and W cycles, application of 
the smoother multiple times at each level) are borrowed from the multigrid literature. There are 
other multigrid techniques that may prove to be useful to our ML algorithm. The over-correction 
idea is directly analogous to over-relaxation in SOR. 

We first consider how more intelligent aggregation of the Markov chain can affect the per- 
formance of the ML algorithm. We consider the same tandem queueing network in section 4.1, 
figure 9, except that we assume finite capacity (blocking) queues with a capacity of c and set 
p — 0. We assume the queues are finite capacity to facilitate the ease of an intelligent aggre- 
gation technique. Figure 12 shows on the left the Markov chain generated by this modified 
tandem queue. The states of the Markov chain may be written as a two-dimensional lattice of 
size (c+l)x(c+l). The transitions then form a regular pattern, somewhat analogous to the 


16 



grid of a discretized PDE. We assume the states to be numbered lexicographically from top left 
to bottom right and for simplicity that c+ 1 be a power of two. The simple aggregation strategy 
used would successively pair states that are adjacent horizontally until no longer possible and 
then pairwise vertically, as illustrated on the upper right. An alternative strategy more appro- 
priate to the structure of the problem is shown on the lower right, where fine level states are 
grouped into 2x2 units. 

Table 1 shows the number of floating point operations needed by the ML algorithm applied to 
the Markov chain of Figure 12. We compare the one-dimensional, pairwise aggregation strategy 
( 1-D ) with the two-dimensional case ( 2-D ). We consider from one to three smoothing steps with 
GS (v = 1, 2, 3). For comparison, GS requires 30.1 MFLOPs and SOR with an optimal choice 
for u requires 14.1 MFLOPs. 



1 - D 

2-D 

V 

1 2 

3 

1 2 

3 

MFLOPs 

12.7 9.3 

8.2 

5.4 5.2 

4.9 


Table 1: Effect of v and aggregation strategy on the ML algorithm. 

It can be clearly seen that the two-dimensional aggregation strategy is significantly faster 
than the simpler scheme. The former can be also be improved (by about 33%) by performing 
additional relaxation steps, whereas the latter algorithm, which is already superior, improves to 
a lesser extent. In this experiment the best-case ML scheme achieves a speedup of 6.1 over GS 
and 2.9 over SOR. 


U 

1.0 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 

MFLOPs 

5.38 

5.21 

4.67 

4.14 

3.78 

3.60 

3.25 

3.25 

3.07 

2.89 

3.07 


Table 2: Effect of overcorrection on operation count. 

Table 2 shows the effect of the ’’overcorrection” according to (21) on the operation count 
of the ML scheme applied to the same problem, where u> ranges from 1.0 to 0.0. A significant 
improvement is found to be achievable, the optimal value of u giving rise to an improvement of 
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approximately 53% over the unmodified correction. The ML scheme with the optimal overcor- 
rection and adapted aggregation strategy thus requires less than 1/10 (1/5) of the floating point 
operations of the GS (SOR) algorithm. 

Figure 13 shows the operation counts of the Multi-Level algorithm applied to the simple 
birth-death chain of length 512 with birth and death rates of 100 and 101 respectively. We 
consider the V-, F- and W-cycle types with varying degrees of over correction. For values of 
u) — l, i.e. without over correct ion, it can be seen that both W and F cycles are superior, despite 
their higher per iteration cost. All cycle types show improvements with overcorrection, the V 
cycle being most sensitive, exhibiting divergence when a ; is chosen to be too small. This behavior 
is, of course, well known with the SOR algorithm where a slightly overestimated value of u can 
lead to fast divergence. The V cycle is improved by a factor of 16.8, the F cycle by 6.7 and 
the W cycle by a factor of 5.87. All ML variants are approximately equally efficient when their 
optimal u> is chosen. 

Figure 14 shows comparable results for SOR applied to the same problem. Here, however, a 
logarithmic scale is used. We include the F cycle result of figure 13 for comparison purposes. 
From the figure the sensitivity of SOR to the choice of u> is clearly demonstrated: the range for 
which fast convergence is achieved is extremely narrow. By comparison, the F cycle is much 
more robust, i.e. it is more tolerant of inaccurate values of w. The unmodified ML algorithm 
is 186 times faster than unmodified SOR (i.e. GS), and the optimal ML algorithm is still 8.7 
times faster than the optimal SOR. 

4.3 Comparison with Takahashi's Algorithm 

In this section we present results from initial comparisons with the iterative aggregation-dis- 
aggregation algorithms of Takahashi [16] as specified in the survey paper by Schweitzer [13]. 
As outlined in section 3, the Takahashi algorithm aggregates the fine level chain of n states 
into a coarse Markov chain of N states. We define the aggregation factor to be the number 
of fine level states aggregated into each coarse state. We consider a birth-death chain of 4096 
states with a birth rate of 2.0 and a death rate of 1.0. We solve the chain using the Takahashi 
algorithm with aggregation factors of 2, 4, 8, and 16. We could not get the algorithm to 
converge for larger aggregation factors. The ratio of the Takahashi solution time over the ML 
solution time is plotted in figure 15. The computation time for ML is constant since we fix the 
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aggregation factor at 2 for the ML algorithm. In another experiment, results not shown, we 
have observed that larger aggregation factors increase the solution time for the ML algorithm. 
The performance of the Takahashi algorithm is extremely sensitive to the aggregation factor, 
the computation times varying by more than an order of magnitude, whereas the ML algorithm 
requires no tuning of any parameters. Interestingly, Takahashi’s algorithm performs best at 
a modest ratio of n - 8 N> as opposed to the case N < n, as the algorithm was originally 
intended. In the best case, the Takahashi algorithm requires a factor of 1.8 more computation 
time. Results with the birth-death rates reversed are similar. 

We also attempted to test the nested Takahashi algorithm, i.e. the coarse chain is solved 
recursively using the Takahashi algorithm, but only succeeded in getting the algorithm to con- 
verge for a few experiments. The problem with convergence has been previously noted [13]. For 
those few experiments that convergence was achieved, the ML had a smaller computation time. 

Hence, it appears that the Takahashi algorithm could be used to achieve solution times ap- 
proaching that of the ML algorithm, but suffers the drawbacks of requiring fine tuning of the 
aggregation factor, and also convergence problems. The Takahashi algorithm was originally 
proposed to reduce the memory requirements for solving Markov chains, whereas the ML algo- 
rithm actually uses more memory than standard solution techniques, hence it may not be fair 
to compare the two without taking memory requirements into consideration. 

5. Discussion of the algorithm and the results 

Memory requirements. The implementation of the ML algorithm requires one additional 
variable at each state compared to the SOR scheme in order to store the temporary values p. In 
addition, the overall number of states needed is higher than for SOR because of the additional 
recursively aggregated systems. If the number of states of the original problem is n and this 
number is reduced by a factor of / during each aggregation step, then the overall number of 
states s needed by the ML algorithm is bounded by 

n 


Thus in the examples considered in section 4 we have / = 1/2 and therefore $ < 2 n. Smaller 
values of / lead to smaller memory requirements, and indeed we have observed improvements 
in efficiency with / = 1/4 and / = 1/8, giving s < 4n/3 and s < 8n/7, respectively. 
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Implementation effort. The ML algorithm is evidently more complex than the SOR 
scheme, additional coding being required for the the treatment of the coarse grid equations. 
The implementations used in section 4 required however, only 329 and 576 lines of C for the 
SOR and the ML algorithms respectively. Thus we consider implementation overhead not to be 
significant. 

Parallelization. The Multi-Level algorithm will parallelize well, given an appropriate choice 
of smoother. Evidently, there are serious difficulties involved with the parallel execution of the 
GS and SOR algorithms, owing to the recursive nature of these schemes. However, in the 
context of multigrid algorithms, it has been shown that the ’’multi-color” style of GS can allow 
efficient parallelization without compromising convergence speed [8]. Parallelization of ML is 
done by data partitioning within each level. With a multi-color smoother, all the operations of 
the ML method at a given level can be performed concurrently. Communication will be required 
between processors for the smoothing step and collect /broadcast operations for the convergence 
test and enforcement of (2). Although coarser levels will run less efficiently, as less computation 
is performed there, the coarse granularity of the finer levels, where most computational work is 
located, will ensure overall good parallel performance. Thus we conclude that the ML algorithm, 
too, will perform well on a multiprocessor system or workstation cluster. We are at present 
developing a parallel version of the algorithm, and hope to report on the results in the near 
future. 

Cycle types. There are other alternatives to processing the levels of aggregation in the 
downward-upward sweep used in the present scheme. These can be obtained by making the 
number of recursive calls to procedure mss in Figure 2 a function of the level number /. Thus 
coarser levels may be visited more than once during one iteration. Such techniques can, in the 
classical multigrid context, lead to improved efficiencies, as the coarse level equations are then 
more accurately solved. Experiments reported in the previous section have shown that this can 
also be the case for the ML algorithm presented here. One may furthermore consider dynamic 
cycling after Brandt [1] or adaptive cycling according to Rude [10]. 

Choice of aggregate equation. The. algebraic multigrid algorithm often defines the matrix 
of the aggregated level as a function of the finer level matrix and the prolongation and restriction 
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operators represented in matrix form: 


Q = RPI . 

In the present context this would have the advantage of not having to recompute the matrix Q 
at every iteration. However, property (7) would be lost. 

Choice of coarsening strategy. Convergence characteristics may be improved by a ju- 
dicious choice of aggregation strategy. In particular, Markov chains derived from queueing 
networks can possess a regular structure which may be exploited. Experiments have shown this 
to be the case for the tandem queue example of section 3. More work is needed to create a 
Multi-Level algorithm which automatically finds good aggregation strategies. 

Additive correction. Multigrid algorithms use the coarser grids to compute an additive 
correction, rather than a multiplicative one. Our attempts to develop such a algorithm were only 
partially successful in that occasionally, negative correction values from the coarse grid induced 
negative values in the solution at the finer level, leading to a divergent iteration. Similar results 
were observed when, by analogy with multigrid schemes, a defect was used to generate an 
inhomogeneous aggregated equation. Further work is needed to determine whether a multigrid- 
like algorithm can be found for the Markov problem. 

6. Conclusions and Outlook 

The ML algorithm presented in this paper has been shown to require significantly less com- 
putation time than the SOR scheme for a number of test problems. The difference between the 
two algorithms increases with the number of states of the Markov chain. In addition to being 
significantly faster, based on our experience to date it appears that the ML algorithm is much 
more resilient to variations in transition rates. Another nice property of the ML algorithm is 
that excellent performance can be obtained without the necessity of tuning a parameter, such 
as the over-relaxation parameter in SOR. We have shown that ML performance can be even 
further improved, up to a factor of 6 times faster in our initial experiments, by using F and W 
cycles or over- correction. 

Examination of a larger sample of cases, including Markov chains from real applications have 
yet to be made. However, we feel that the algorithm shows enough promise to justify further 
investigation into ML schemes for solving Markov chains. 
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Further work will also include the implementation and testing of the techniques mentioned 
in the previous section. In particular, attention will be paid to choosing an aggregation strategy 
for general Markov chains, where no o priori knowledge of the topology is available. Several of 
these techniques have already proven to provide improved efficiency in preliminary experiments. 

A parallel version of the ML algorithm is also in preparation, and we hope to be able to 
present results obtained on a MIMD supercomputer in the near future. 
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Computation Time in Seconds Iterations 










Figure 4: Birth Death problem with fixed diameter. Left : Computation time; Right: Number 
of iterations. 




Ratio of (deathRate / birthRate) Ratio of (deathRate / birthRate) 

Figure 5: Birth-Death chain, effect of varying rates. Left : Computation time; Right : Time 
Ratio. 


25 







Computation Time in Seconds Computation Time in Seconds 






0 0-2 M fl.fi 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 

p {probability return to queue 1} p (probability return to queue 1) p (probability return to queue 1) 

Figure 8: Tandem queue problem, variation of p, Computation times. Left : pi = p2 = 1; 
Centre : pi = 1, p2 = 2; Right : pi = 2, p2= l. 



Figure 9: Tandem Queueing Network 



Figure 10: Molloy’s Problem 
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Figure 11: Results for Molloy’s problem 
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Figure 12: Markov state space for tandem blocking queue 
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Figure 13: Effect of overcorrection for ML algorithm. Birth (rate 100) / death (rate 101) chain, 
length 512. GS requires 899 MFLOPs, SOR requires 8.48 MFLOPs. 
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Figure 14: Effect of overrelaxation/overcorrection for SOR and ML algorithms. Birth (rate 100) 
/ death (rate 101) chain, length 512. 



Figure 15: Comparison With Takahashi Algorithm, Time Ratio. 
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